For this exercise, we will use a mass cytometry (CyTOF) dataset from Bodenmiller et al. (2012), consisting of 8 paired samples (16 samples) of stimulated (BCR-XL) and unstimulated peripheral blood mononuclear cells (PBMCs) from healthy individuals. The dataset has been made available through Biocondcutor’s ExperimentHub and can be loaded into R as follows:
ExperimentHub package.eh with the ExperimentHub() function.query()to retrieve any records that match the keyword “Bodenmiller”.eh[[<id>]].library(ExperimentHub)
eh <- ExperimentHub()
q <- query(eh, "Bodenmiller")
fs <- eh[["EH2255"]]
## Warning: package 'S4Vectors' was built under R version 4.1.3
SingleCellExperimentNext, we want to construct an object of Bioconductor’s SingleCellExperiment (SCE) class. Herefor, we will first load the experimental panel (containing metadata on each marker) and metadata table (containing metadata on each sample).
# download panel & metadata
url <- "http://imlspenticton.uzh.ch/robinson_lab/cytofWorkflow"
fns <- list(panel = "PBMC8_panel_v3.xlsx" , md = "PBMC8_metadata.xlsx")
for (fn in fns) download.file(file.path(url, fn), destfile = fn, mode = "wb")
# load into R
library(readxl)
panel <- read_excel(fns$panel)
md <- read_excel(fns$md)
prepData() to construct a SCE from the flowSet,library(CATALYST)
sce <- prepData(fs, panel, md)
CATALYST provides a simple wrapper to i) perform high resolution FlowSOM clustering; and, ii) lower resolution ConsensusClusterPlus metaclustering. Here, the data will be initially clustered into xdim x ydim groups. Secondly, the function will metacluster populations into 2 through maxK clusters.
Clusterings are stored under cluster_codes(.), and cluster IDs for a given clustering “x” can be accessed via cluster_ids(., "x") where x is a column name of cluster_codes.
sce$sample_id) assigned to each of the 5 metaclusters (meta5).meta8) using plotAbundances();by and choose the one that is easier to interpret.sce <- cluster(sce, xdim = 8, ydim = 8, maxK = 10, verbose = FALSE)
table(sce$sample_id, cluster_ids(sce, "meta5"))
##
## 1 2 3 4 5
## BCRXL1 276 750 6 1666 140
## BCRXL2 2363 5912 161 7842 397
## BCRXL3 2960 3883 209 4453 747
## BCRXL4 2301 2801 155 3295 438
## BCRXL5 1658 3124 191 3112 458
## BCRXL6 2037 3761 147 2251 426
## BCRXL7 2563 3871 210 7440 686
## BCRXL8 2760 3597 169 4641 486
## Ref1 391 1167 9 993 179
## Ref2 3623 7596 189 4642 675
## Ref3 2741 3106 149 2382 1056
## Ref4 2559 2002 102 1841 402
## Ref5 3757 4459 216 2771 759
## Ref6 2939 5035 126 2113 825
## Ref7 4394 4237 176 6119 1048
## Ref8 3729 4887 149 4165 740
plotAbundances(sce, k = "meta8", by = "cluster_id")
The methodology presented here is based on a dichotomy of markers into “type” markers (used for clustering cells into subpopulations) and “state” markers (tested for differences in expression across conditions). The set of type and state markers defined in the SCE can be viewed via type/state_markers(.).
This dichotomy is based on the assumption that type marker expressions are fairly consistent across experimental groups, i.e., stimulation does not alter cell type definitions. When breaking this assumption, there’s a risk that identified clusters don’t represent true subpopulations, but rather a group of cells whose expression has changed in one condition.
plotMultiHeatmap() with hm1 = FALSE, plot a series of heatmapstype_markers(sce)
## [1] "CD3" "CD45" "CD4" "CD20" "CD33" "CD123" "CD14" "IgM"
## [9] "HLA-DR" "CD7"
state_markers(sce)
## [1] "pNFkB" "pp38" "pStat5" "pAkt" "pStat1" "pSHP2" "pZap70" "pStat3"
## [9] "pSlp76" "pBtk" "pPlcg2" "pErk" "pLat" "pS6"
# e.g., CD7 only expressed in BCRXL samples for most clusters
plotMultiHeatmap(sce,
k = "meta8",
hm1 = FALSE,
hm2 = type_markers(sce))
One of the most popular ways to visualize single cell data is via dimensionality reduction (DR), where each cell is projected into a lower, usually two-dimensional, space. In this dataset, each of N cells initially is represented as a vector of M measurements (for M markers); in the dimension reduction, this \(N \times M\) matrix will be compressed into an \(N \times 2\) or \(N \times 3\) matrix for visualization, with the hope that the 3 dimensions retains most of the dominant signal in the data. Here, we will use a nonlinear dimensionality reduction technique, uniform manifold approximation and projection (UMAP).
When coloring DR plots by, e.g., sample/patient ID, cells should be spread homogeneously, while grouping of cells from the same sample/patient is indicative of batch/patient effects. When coloring by experimental condition, we expect shifts in the 2D-projection of cells when the condition has a global effect on expression, and no shift if it does not.
runDR(), compute a UMAP embedding of at most 500 cells per sample.plotDR(), plot UMAP embeddings colored by patient_id and condition.sce <- runDR(sce, dr = "UMAP", cells = 500)
# no visible patient effect; cells mix more or less
# homogeneously when colored by patient
plotDR(sce, color_by = "patient_id")
# slight shift between cells from different treatment groups
plotDR(sce, color_by = "condition")
This exercise is advanced and covers material that has not been discussed in lectures. If you have time and interest, you are welcome to try and answer this question.
pbMDS() renders a multi-dimensional scaling (MDS) plot on median expression values. Such a plot will give a sense of similarities between samples in an unsupervised way and of key difference in expression before conducting any formal testing.
condition and labelled by patient_id.# 1st dim. separates by condition, 2nd by patient
# e.g., patients 5,7,8 and 1,2,3 group together
pbMDS(sce, color_by = "condition", label_by = "patient_id")
plotMultiHeatmap() to visualize the set of state markers across the 8 metaclusters.# e.g., pNFkB up in Ref, pS6 up in BCRXL, pErk unaffected
plotMultiHeatmap(sce,
k = "meta8",
hm1 = FALSE,
hm2 = state_markers(sce))
DS analysis aims at identifying markers whose expression is differential across conditions, within each subpopulation. For differential testing, we will use methods implemented in the diffcyt package. The diffcyt() wrapper function returns a list containing differential testing results under rowData(.$res).
To present how accounting for the between patient variability with the mixed model increases sensitivity, we will fit both a regular linear model, as well as a linear mixed model with a random intercept for each patient.
# load diffcyt
library(diffcyt)
# extract experimental design table
ei <- metadata(sce)$experiment_info
# create model formulas with & without patient effect
ds_formula1 <- createFormula(ei, "condition", "patient_id")
ds_formula2 <- createFormula(ei, "condition")
# create contrast matrix
contrast <- createContrast(c(0,1))
# run diffcyt 2x
res_ds1 <- diffcyt(sce, ei,
formula = ds_formula1, contrast = contrast,
analysis_type = "DS", method_DS = "diffcyt-DS-LMM",
clustering_to_use = "meta8")
res_ds2 <- diffcyt(sce, ei,
formula = ds_formula2, contrast = contrast,
analysis_type = "DS", method_DS = "diffcyt-DS-LMM",
clustering_to_use = "meta8")
# extract results
tbl_ds1 <- rowData(res_ds1$res)
tbl_ds2 <- rowData(res_ds2$res)
# there are 8*14 = 112 tests:
# each state marker is tested in each cluster
8*length(state_markers(sce)) == nrow(tbl_ds1)
## [1] TRUE
# accounting for patient effect increases sensitivity:
# ~90 vs. 40 cluster-marker instances called differential
fdr <- 0.05
table(tbl_ds1$p_adj < 0.05)
##
## FALSE TRUE
## 27 85
table(tbl_ds2$p_adj < 0.05)
##
## FALSE TRUE
## 73 39
plotDiffHeatmap(), visualize the top 50 hits for one of the result tables above.facet_by.# induced: pS6; repressed: pBtk, pNFkB
tbl <- rowData(res_ds1$res)
plotDiffHeatmap(sce, tbl, top_n = 50)
plotDR(sce,
dr = "UMAP", facet_by = "condition",
color_by = c("pS6", "pBtk", "pNFkB"))